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Abstract 

This paper presents a kernel formulation of the recently introduced 
diff-hash algorithm for the construction of similarity-sensitive hash 
functions. Our kernel diff-hash algorithm that shows superior perfor- 
mance on the problem of image feature descriptor matching. 

1 Introduction 

Efficient representation of data in compact and convenient way to similarity- 
sensitive hashing methods, first considered in [H] and later in (3J EH [T7J [321 
[22] . Similarity-sensitive hashing methods can be regarded as a particular 
instance of supervised metric learning [21 [31], where one tries to construct 
a hashing function on the data space that preserves known similarity on 
the training set. Typically, the similarity is binary and can be related to 
hash collision probability (similar points should collide, and dissimilar points 
should not collide). Such methods have been enjoying increasing popularity 
in the computer vision and pattern recognition community in image analysis 
and retrieval [131 123 HH EE21 EH HE], video copy detection [5], and shape 
retrieval [TJ. 

Shakhnarovich [24] considered parametric hashing functions with affme 
transformation of the data vectors (projection matrix and threshold vector) 
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followed by the sign function. He posed the problem of similarity-sensitive 
hash construction as boosted classification, where each dimension of the hash 
acts as a weak binary classifier. The parameters of the hashing function were 
learned using AdaBoost. In [25], we used the same setting of the problem 
and proposed a much simpler algorithm, wherein projections were selected 
as eigenvectors of the ratio or difference of covariance matrices of similar and 
dissimilar pairs of data points; the former method was dubbed as LDA-hash 
and the latter as diff-hash. Applying these methods to SIFT local features 
in images [18], very compact and accurate binary descriptors were produced. 

The inspiration to this paper is the diff-hash method [25]. While being 
remarkably simple and efficient, this method suffers from two major limita- 
tions. First, the length of the hash is limited by the descriptor dimensionality. 
In some situations, this is a clear disadvantage, as longer hashes allow to pro- 
duce more accurate matching. Secondly, the affine hashing functions are in 
many cases too simple and fail to represent correctly the structure of the 
data. In this paper, we propose a kernel formulation of the diff-hash algo- 
rithm which efficiently resolved both problems. We show the performance of 
the algorithm on the problem of image descriptor matching using the patches 
dataset from [33] and show that it outperforms the original diff-hash. 

2 Background 

Let ICR" denote the data space. We denote by V the set of pairs of similar 
data points (positives) and by Af the set of pairs of dissimilar data points 
(negatives). The problem of similarity- sensitive hashing is to represent the 
data in a common space HP™ = { — 1, +l} m of m-dimensional binary vectors 
with the Hamming metric o?H m («, b) = y — \ X^=i a *^« by means of a map 
£ : X — > H m such that d^m o (£ x ^)\ v « on and d H m (£ x £)|at ~ 
m. Alternatively, this can be expressed as having E{(fem o (£ x r])\V} ~ 
(i.e., the hash has high collision probability on the set of positives) and 
E{<iH m ° (£ x v) I -A/"} ^ 0- The former can be interpreted as the false negative 
rate (FNR) and the latter as the false positive rate (FPR). 

2.1 Similarity- sensitive hashing (SSH) 

To further simplify the problem, Shakhnarovich [24] considered parametric 
hashing function of the form £(x) = sign(Px + a), where P is m x n pro- 
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jection matrix and a is an m x 1 threshold vector. The similarity-sensitive 
hashing (SSH) algorithm considers the hash construction as boosted binary 
classification, where each hash dimension acts as a weak binary classifier. For 
each dimension, AdaBoost is used to maximize the following loss function 

min V" tu i (x J x')s(x,x / )£ i (x)&(x / ), (1) 
(x,x')e"PuAr 

where &(x) = sign(p^x + cij), s(x, x') = 1 for (x, x') G Af and for 
(x, x') G V and Wj(x, x') is the AdaBoost weigh for pair (x, x') at ith it- 
eration. Shakhnarovich [23] selected p { as the axis projection onto which 
minimizes the objective. In [SJE], minimization problem (jTJ was relaxed in 
the following way : First, removing the non-linearity and setting a» = 0, find 
the projection vector p^. Then, fixing the projection p^, find the threshold 
Oj. The disadvantages of the boosting-based SSH is first high computational 
complexity, and second, the tendency to find unnecessary long hashesQ 

2.2 Diff-hash 

In [25], we proposed a simpler approach, computing the similarity-sensitive 
hashing by minimizing 

L(0 = aE{d wn o(£xt)\V}-E{d wn o(txO\M} 

= + |E{e T e|AT} - f E{e^\V} (2) 

w.r.t. the map £. Problem (j2]) is equivalent, up to constants, to minimizing 
the correlations 

L(P,a) = E{sign(Px + a) T sign(Px + a)|A/'} 

- aE{sign(Px + a) T sign(Px + a)|P} (3) 

w.r.t. the projection matrix P and threshold vector a. The first and second 
terms in ([3]) can be thought of as FPR and FNR, respectively. The param- 
eter a controls the tradeoff between FPR and FNR. The limit case a> 1 
effectively considers only the positive pairs ignoring the negative set. 

Problem is a highly non-convex non-linear optimization problem dif- 
ficult to solve straightforwardly. Following [51 [8] , we simplify the problem in 

1 The second problem can be partially resolved by using sequential probability testing 
[6] which creates hashes of minimum expected length. 
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the following way. First, ignore the threshold and solve a simplified problem 
without the sign non-linearity for projection matrix P, 

min E{(Px) T (Px)|AT} - aE{(Px) T (Px)|P} = 
p t p=i 

min tr (P T E{xx T |A/"}P) - crtr (P T E{xx T |P}P) = 
p t p=i 

min tr(P T (S^- aS P )P), (4) 
p t p=i 

where H-p, Xjv denote the n x n covariance matrices of the positive and nega- 
tive data. The solution of (TjJ is given explicitly as P = [A^( 2 , m+1 v n _ m+ i, . . . , \l/ 2 v n } 
the m smallest eigenvectors of the matrix — «Sp = VAV T of weighted 
covariance differences!! 

Second, fixing the projections find optimal threshold vector a, 

min E{sign(Px + a) T sign(Px' + a) \Af} 

a 

-aE{sign(Px + a) T sign(Px + a)\V} = 
min Yh=x E{sign(p^x + a i )sign(p J T x + a*) \N] 

{Oi} 

~ a 1 E{sign(p?x + aj)sign(p?x + a { ) \V}. 

The problem is separable and can be solved independently in each dimension 
i. The above terms are the false positive and negative rates as function of 
the threshold ctj, 

FNR(ai) = Pr(p^x + a { < and p^x' + a; > 0\V) 

+ Pr(pjx + Oi > and pjx.' + a { < 0\V) 

and 

FPR(ai) = Pr(p?x + a { < and p^x' + a { < 0\Af) 

+ Pr(p?x + a, > and p^x' + a { > 0|jV). 

The above probabilities can be estimated from histograms (cumulative dis- 
tributions) of p^x and q^y on the positive and negative sets. The optimal 
threshold 

a* = argmin aFNR(a) + FPR(a) (5) 

a 

is obtained by means of one-dimensional exhaustive search. 



2 The name of the algorithm diff-hash refers in fact to this covariance difference matrix. 
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3 Kernel diff-hash 



An obvious disadvantage of diff-hash (and spectral methods in general) com- 
pared to AdaBoost-based methods is that it must be dimensionality-reducing: 
since we compute projection P as the eigenvectors of a covariance matrix of 
size n x n, the dimensionality of the embedding space must be m < n. This 
restriction is limiting in many cases, as first it depends on the data dimen- 
sionality, and second, such a dimensionality may be too low and a longer 
hash would achieve better performance. Furthermore, the affine parametric 
form of the embedding £ is in many cases an oversimplification, and some 
more generic map is required. 

In this paper, we cope with both problems using a kernel formulation, 
which transforms the data into some feature space that is never dealt with 
explicitly (only inner products in this space, referred to as kernel [23J, are 
required). In order to simplify the following discussion, since the problem is 
separable (as we have seen, projection in each dimension corresponds to a 
eigenvector of the covariance matrix difference) , we consider one-dimensional 
projections. The whole method is summarized in Algorithm [TJ 

3.1 Projection computation 

Let kx '■ X x X — y K be a positive semi-definite kernel, and let <fi : x y 
kx(-, x). Thus, 4> maps the data into some feature space, which we represent 
here as a Hilbert space V (possibly of infinite dimension) with an inner prod- 
uct (v)v, and satisfies fcx(x,x') = (kx (•, x), kx(-, x'))v = (0(x), 0(x'))y. 

The idea of kernelization is to replace the original data X with the cor- 
responding feature vectors 4>(X), replacing the linear projection p T x with 
P( x ) = Y!i=x A(<X x i), <K x ))v = /3 T [A;x(xi, x) . . . fcx(x/,x)]. Here, (3 is a vec- 
tor of unknown linear combination coefficients, and xi, . . . ,x; denote some 
representative points in the data space. 

In this formulation, at the projection computation stage we minimize, for 
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each dimension 



m i n rxn p( x My) - p( x My) 

1 1 (x,y)£AA 1 1 (x,y)67» 

p W\ 



TT7T ^ A(0(x i ),0(x)) v /3 j (0(x j ),0(x'))v 



(x,x')eA/"i,i=i 

£ £ A(0(x i ),0(x)) v ^(0(x,),0(x')) 1 

(x,x')6Pi,j'=l 
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min ^/3 T K^K£ /3 - ^/3 T KpKT/3 = min {3 T Kf3, 

where and denote Ix \ J\f\ and Ix \V\ matrices with elements &x(xj, x). 
The optimal projection coefficients a minimizing are given as the smallest 
eigenvectors of the / x I matrix K = j^K^K^- — p^yK-pK^. 

The kernel kx can be selected to account correctly for the structure of the 
data space X. In our formulation, the dimensionality of the hash is bounded 
by the number of the basis vectors, m < I, which is limited only by the 
training set size and computational complexity. 



3.2 Threshold selection 

As previously, the threshold should be selected to minimize the false positive 
and false negative rates, that can be expressed, as previously, as 

FNR(a) = Pr(p(x) + a < and p(x') + a > 0\V) 

+ Pr(p(x) + a > and p(x') + a < 0\V), 

FPR(a) = Pr(p(x) + a < and p(x') + a < 0\Af) 

+ Pr(p(x) + a > and p(x') + a > 0\J\f), 

The optimal threshold is obtained as 

a* = argmin aFNR(a) + FPR(a). (6) 



3.3 Hash function application 

Once the coefficients B and threshold a are computed, given a new data 
point x, the corresponding m-dimensional binary hash vector is constructed 
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Algorithm 1: Kernel diff-hash algorithm. 

Input: Positives set V C X x X, Negatives set M C X x X; 

Dimensionality of the hash m; Kernel kx] Set of vectors 

Xi, . . . ,Xj. 

Output: Optimal combination coefficient matrix B of size m x l; 
optimal offset vector a of size m x 1. 

1 Compute the kernel matrices K-p,K_^- of size / x \V\ and / x \J\f\, 
respectively. 

2 Compute the matrix K = -^K^K^- — j^K-pK^. 

3 Perform eigendecomposition K = VAV T . 

4 for % = 1, . . . , m do 

5 Set the ith row of the coefficient matrices to be the ith smallest 
eigenvectors, f3j = A n _ 1+ iV^_ i+1 . 

6 Compute the projection Pi(x) = f3j~Kx- 

7 Compute the rates FNR(aj) and FPR(aj) for Pi(x) + a i; as 
function of threshold Oj. 

8 Compute the optimal thresholds 

a* = argmin aFNR(a) + FPR(a). 



as ^(x) = sign(B(/cx(xi, x), . . . , kxfai, x)) T +a). Note that this embedding is 
kernel-dependent and has a more generic form than the affine transformation 
used in [21 [25]. 

4 Results 

In order to test our approach, we applied it to the problem of image feature 
matching. This problem is a core of many modern Internet-scale computer 
vision applications, including city scale reconstruction p~|. The basic under- 
lying task in these problems, repeated millions and billions of times, is the 
comparison of local image features (SIFT [18] or similar methods [211 HI [26] ) . 
Typically, these features are represented by means of multidimensional de- 
scriptors vectors (e.g. SIFT is 128-dimensional) and compared using the 
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Figure 1: Example of a positive (left, middle) and negative (left, right) pair of 
image patches and corresponding descriptors. First row: patches, second row: 
SIFT descriptors, third row: binary descriptors of length 32 produced using kDIF. 



Euclidean distance. With very large datasets (containing 10 6 — 10 9 feature 
points), severe scalability issues are encountered, including problems of stor- 
age and similarity query on feature descriptors. Efficient representation and 
comparison of feature descriptors have been addressed in many recent works 
in the computer vision community (see, e.g., [201 [191 EH H21 [33l [Ml HH1 E] ) - In 
[2"5] . we proposed using similarity-sensitive hashing methods to produce com- 
pact binary descriptors [23] . Such descriptors have several appealing prop- 
erties that make them especially suitable in large-scale applications. First, 
they are compact (typically, 64 — 256 bits, compared to at least 1024 required 
for the standard SIFT) and easy to store in standard databases. Second, the 
comparison of binary descriptors is done using the Hamming metric, which 
amounts to XOR and bit count - an operation that can be carried out ex- 
tremely efficiently on modern CPU architectures, significantly faster than the 
computation of Euclidean or other L p distances. Finally, the construction 
of the binarization transformations involves metric learning, thus modeling 
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more correctly the distance between the descriptors, which is usually non- 
Euclidean. In particular, this allows to compensate for imperfect invariance 
of the descriptor (since viewpoint transformations are only approximately 
locally affine) and cope with descriptor variability in pairs of images with 
wide baseline. As a result of this last property, the use of similarity-sensitive 
hashing reduces the descriptor size while actually improving its performance 
|25j . unlike other methods that typically come at the price of decreased per- 
formance. 

In our experiments, we used data from [33]. The datasets contained 
rectified and normalized 64 x 64 patches extracted from multiple images de- 
picting three different scenes (Trevi fountain, Notre Dame cathedral, and 
Half Dome). The first two scenes were similar representing architectural 
landmarks; the last scene was different representing a natural mountain en- 
vironment. In each scene, a total of nearly 100K patches corresponding to 
around 30K different feature points were available; each feature appeared 
multiple times. For training, we used 100K pairs of patches corresponding 
to different views of the same points as positives, and 200K pairs of patches 
from different points as negatives (Figured]). For testing, a different subset 
of the dataset containing 50K positive and 50K negative pairs was used. 

In each patch, a 128-dimensional (8-bit per dimension) SIFT descriptor 
was computed using the toolbox of Vedaldi [29]. We compared the per- 
formance of binary descriptor obtained by means of the diff-hash method 
of Strecha at al. [25] (DIF) and our kernel version (kDIF). Diff-hash ap- 
peared to be the best performing algorithm in an extensive set of evaluations 
done in [25J. Since kDIF is an extended version of DIF, we choose to com- 
pare to this method. In both methods, we used the value a = 25 which 
was experimentally found to produce the best results. In kDIF, we used 
a Gaussian kernel with the Mahalanobis distance of the form fcx(x, x') = 
exp{ — (x — x') T E^ 2 (x — x')}. The same training and testing data were 
used for all methods. For reference, we show the Euclidean distance between 
the original SIFT descriptors. 

Figures [2H3] show the performance of different hashing algorithms as a 
function of m on different datasets. Several conclusions can be drawn from 
this figure. First, kDIF appears to consistently outperform DIF on all three 
scenes for the same hash length m. Second, for sufficiently large m, our 
method outperforms SIFT while still being more compact. Third, the learned 
hashing functions generalize gracefully to other scenes, though slight per- 
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formance degradation is noticeable when training on mountain scene (Half 
Dome) and using the learned hash in an architectural scene (Note Dame). 

Figure H] compares the performance of different descriptors in terms of 
FNR at two low FPR points (0.1% and 0.01%). Binary descriptors outper- 
form raw SIFT while being 2-4 more compact (to say nothing about the 
lower computational complexity of the Hamming distance compared to the 
Euclidean distance). Second, kDIF consistently outperforms DIF. Third, one 
can see that using longer hash (m > 128) increases the performance. 

Figure [5] shows a few examples of first matches between patch descriptors 
obtained using Euclidean distance and the Hamming distance on the hashed 
descriptors using our method. Our method provides superior performance. 

5 Conclusions 

We presented kernel formulation of diff-hash similarity-sensitive hashing al- 
gorithm and showed how this method can be used to produce efficient and 
compact binary feature descriptors. Though we showed results with SIFT, 
the method is generic and can be applied to any local feature descriptor. Our 
method showed superior results compared to the original diff-hash proposed 
in [25], and is more generic as it allows to obtain hashes of any length and 
also incorporate nonlinearity through the choice of the kernel. 
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Figure 2: ROC curves showing the performance of Euclidean distance between 
SIFT descriptors (dashed black) and Hamming distance between binary vectors of 
different dimension m = 32, 64, . . . , 512 constructed using DIF (dash-dot red) and 
kDIF (solid blue) hashing algorithms. Captions follow the convention training-test. 



14 




Eucl 

DIF 

kDIF 



10~ 4 10 -3 1(T 2 10~ 1 10° 

False positive rate 

(a) trevi-trevi 




1Cf 4 1Cf 3 1(T 2 10~ 1 10° 

False positive rate 



(b) trevi-notredame 

Figure 3: ROC curves showing the performance of Euclidean distance between 
SIFT descriptors (dashed black) and Hamming distance between binary vectors of 
different dimension m = 32, 64, . . . , 512 constructed using DIF (dash-dot red) and 
kDIF (solid blue) hashing algorithms. Captions follow the convention training-test. 
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Figure 4: Performance (FNR at 0.1% and 0.01% FPR; the smaller the better) of 
different methods as function of descriptor size in bits. Training was done on trevi 
dataset; testing on notredame dataset. 
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Figure 5: First matches using Euclidean distance between SIFT descriptors (odd 
rows) and Hamming distance between 512-dimensional binary vectors constructed 
using our kDIF hashing algorithms (even rows). Query image is shown on the 
left, first five matches are shown on the right. Numbers indicate the distance from 
query. Wrong matches are marked in red, correct matches are marked in green. 
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